36 research outputs found

    A novel dimensionality reduction technique based on independent component analysis for modeling microarray gene expression data

    Get PDF
    DNA microarray experiments generating thousands of gene expression measurements, are being used to gather information from tissue and cell samples regarding gene expression differences that will be useful in diagnosing disease. But one challenge of microarray studies is the fact that the number n of samples collected is relatively small compared to the number p of genes per sample which are usually in thousands. In statistical terms this very large number of predictors compared to a small number of samples or observations makes the classification problem difficult. This is known as the ā€curse of dimensionality problemā€. An efficient way to solve this problem is by using dimensionality reduction techniques. Principle Component Analysis(PCA) is a leading method for dimensionality reduction of gene expression data which is optimal in the sense of least square error. In this paper we propose a new dimensionality reduction technique for specific bioinformatics applications based on Independent component Analysis(ICA). Being able to exploit higher order statistics to identify a linear model result, this ICA based dimensionality reduction technique outperforms PCA from both statistical and biological significance aspects. We present experiments on NCI 60 dataset to show this result

    A factor analysis model for functional genomics

    Get PDF
    BACKGROUND: Expression array data are used to predict biological functions of uncharacterized genes by comparing their expression profiles to those of characterized genes. While biologically plausible, this is both statistically and computationally challenging. Typical approaches are computationally expensive and ignore correlations among expression profiles and functional categories. RESULTS: We propose a factor analysis model (FAM) for functional genomics and give a two-step algorithm, using genome-wide expression data for yeast and a subset of Gene-Ontology Biological Process functional annotations. We show that the predictive performance of our method is comparable to the current best approach while our total computation time was faster by a factor of 4000. We discuss the unique challenges in performance evaluation of algorithms used for genome-wide functions genomics. Finally, we discuss extensions to our method that can incorporate the inherent correlation structure of the functional categories to further improve predictive performance. CONCLUSION: Our factor analysis model is a computationally efficient technique for functional genomics and provides a clear and unified statistical framework with potential for incorporating important gene ontology information to improve predictions

    The utility and predictive value of combinations of low penetrance genes for screening and risk prediction of colorectal cancer

    Get PDF
    Despite the fact that colorectal cancer (CRC) is a highly treatable form of cancer if detected early, a very low proportion of the eligible population undergoes screening for this form of cancer. Integrating a genomic screening profile as a component of existing screening programs for CRC could potentially improve the effectiveness of population screening by allowing the assignment of individuals to different types and intensities of screening and also by potentially increasing the uptake of existing screening programs. We evaluated the utility and predictive value of genomic profiling as applied to CRC, and as a potential component of a population-based cancer screening program. We generated simulated data representing a typical North American population including a variety of genetic profiles, with a range of relative risks and prevalences for individual risk genes. We then used these data to estimate parameters characterizing the predictive value of a logistic regression model built on genetic markers for CRC. Meta-analyses of genetic associations with CRC were used in building science to inform the simulation work, and to select genetic variants to include in logistic regression model-building using data from the ARCTIC study in Ontario, which included 1,200 CRC cases and a similar number of cancer-free population-based controls. Our simulations demonstrate that for reasonable assumptions involving modest relative risks for individual genetic variants, that substantial predictive power can be achieved when risk variants are common (e.g., prevalenceĀ >Ā 20%) and data for enough risk variants are available (e.g., ~140ā€“160). Pilot work in population data shows modest, but statistically significant predictive utility for a small collection of risk variants, smaller in effect than age and gender alone in predicting an individualā€™s CRC risk. Further genotyping and many more samples will be required, and indeed the discovery of many more risk loci associated with CRC before the question of the potential utility of germline genomic profiling can be definitively answered

    Soft decision trees

    No full text
    grantor: University of TorontoSoft Decision Trees (SDT's) are a new class of semi-parametric methods for classification and regression. They attempt to retain the features that made tree-like techniques widely popular (interpretability, graphical summary of the result, automatic variable selection and interaction detection, etc.) while improving their predictive performance and making the model more believable. This is done by employing "soft", or stochastic splits which result in blurred partition boundaries and a continuous prediction surface. The parameters are fitted via Maximum Likelihood, using the EM algorithm. Simulation experiments indicate that the SDT's are indeed more powerful predictors. Real data analysis shows that SDT's can also aid in interpretation.M.Sc

    Statistical analysis of medical images with applications to neuroimaging

    No full text
    grantor: University of TorontoWe extend a classical multivariate technique: Linear Discriminant Analysis (LDA) and apply it in the analysis of PET and fMRI images of human brain function to discover regions of activation driven by the experimental stimuli. We re-examine and specialize some equivalences between LDA and: Canonical Correlation Analysis (CCA) and Multivariate ANOVA (MANOVA). Furthermore, efficient algorithms are derived to facilitate applying these multivariate models to extremely large image data. We deal with the ill-posed nature of the problem using spatial basis expansion and the penalization (with Penalized Discriminant Analysis (PDA) of Hastie et al. (1995)), and utilize efficient measures of predictive performance to optimize hyperparameters and validate the models in a robust fashion. We examine expanding the images into a 3D tensor-product B-spline and Wavelet basis and compare to the results obtained without expansion. Some parallels between our proposal and some of those currently popular in the neuroimage community are discussed. Another extension to PDA is derived and applied that allows one to model time series effects that exist in fMRI images. We conclude with many possible enhancements to the proposed paradigm.Ph.D

    Reduced-Rank Multivariate Model for Time-Course Microarray Data

    No full text
    Abstract: In this paper we present a novel, multi-gene approach to time course microarray experiments. One of the advantages of our approach is an explicit modeling of correlation structure among gene expression data. The approach proposed is computationally attractive. We apply the model to the well-known cell-cycle yeast microarray data and present results that compare favorably to the results of the previous studies
    corecore